Diamonds EDA

This is an exploration of the diamonds dataset, which is embedded in the ggplot library.

library(tidyverse)
## ── Attaching packages ───────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 
length(diamonds$carat)
## [1] 53940
length(diamonds)
## [1] 10

Quiz 1:

there are 53940 observations of 10 variables. There are three ordered factors (cut, color, clarity) with the best color being “D”

Quiz 2:

a price histogram

ggplot(data=diamonds,aes(x=price))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Quiz 3: describe the shape and center of the distribution

mean(diamonds$price)
## [1] 3932.8
median(diamonds$price)
## [1] 2401

The distribution has a long right tail with a mean of $3932.80 and a median of $2401.

Quiz 4: Diamond counts

How many diamonds cost less than $500? How many diamonds cost less than $250? How many diamnods cost $15,000 or more?

sum(diamonds$price<500)
## [1] 1729
sum(diamonds$price<250)
## [1] 0
sum(diamonds$price>=15000)
## [1] 1656

Explore the largest peak in the price histogram you created earlier.

Try limiting the x-axis, altering the bin width, and setting different breaks on the x-axis.

There won’t be a solution video for this question so go to the discussions to share your thoughts and discover what other people find.

You can save images by using the ggsave() command. ggsave() will save the last plot created. For example… qplot(x = price, data = diamonds) ggsave(‘priceHistogram.png’)

ggplot(data=diamonds,aes(x=price))+
  geom_histogram(binwidth=50)+
  coord_cartesian(xlim=c(250,2500))

## Break out histograms by cut

ggplot(data=diamonds,aes(x=price))+
  geom_histogram(binwidth=50)+
  facet_wrap(~cut)

Price by cut

Which has the highest priced diamond? Which has the lowest priced diamond? Which has the lowest median price?

by(diamonds$price,diamonds$cut,max)
## diamonds$cut: Fair
## [1] 18574
## -------------------------------------------------------- 
## diamonds$cut: Good
## [1] 18788
## -------------------------------------------------------- 
## diamonds$cut: Very Good
## [1] 18818
## -------------------------------------------------------- 
## diamonds$cut: Premium
## [1] 18823
## -------------------------------------------------------- 
## diamonds$cut: Ideal
## [1] 18806
by(diamonds$price,diamonds$cut,min)
## diamonds$cut: Fair
## [1] 337
## -------------------------------------------------------- 
## diamonds$cut: Good
## [1] 327
## -------------------------------------------------------- 
## diamonds$cut: Very Good
## [1] 336
## -------------------------------------------------------- 
## diamonds$cut: Premium
## [1] 326
## -------------------------------------------------------- 
## diamonds$cut: Ideal
## [1] 326
by(diamonds$price,diamonds$cut,median)
## diamonds$cut: Fair
## [1] 3282
## -------------------------------------------------------- 
## diamonds$cut: Good
## [1] 3050.5
## -------------------------------------------------------- 
## diamonds$cut: Very Good
## [1] 2648
## -------------------------------------------------------- 
## diamonds$cut: Premium
## [1] 3185
## -------------------------------------------------------- 
## diamonds$cut: Ideal
## [1] 1810

Which has the highest priced diamond? premium Which has the lowest priced diamond? Premium and Ideal Which has the lowest median price? Ideal

Scales for multiple histograms

In the last exercise, we looked at the summary statistics for diamond price by cut. If we look at the output table, the the median and quartiles are reasonably close to each other.

This means the distributions should be somewhat similar, but the histograms we created don’t show that.

The ‘Fair’ and ‘Good’ diamonds appear to have different distributions compared to the better cut diamonds. They seem somewhat uniform on the left with long tails on the right.

Let’s look in to this more.

Look up the documentation for facet_wrap in R Studio. Then, scroll back up and add a parameter to facet_wrap so that the y-axis in the histograms is not fixed. You want the y-axis to be different for each histogram.

ggplot(data=diamonds,aes(x=price))+
  geom_histogram(binwidth=50)+
  facet_wrap(~cut,scales='free')

Two variables

create histogram of price per carat faceted by cut

ggplot(data=diamonds,aes(x=price/carat))+
  geom_histogram()+
  scale_x_log10()+
  facet_wrap(~cut,scales='free')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Investigate the price of diamonds using box plots, numerical summaries, and one of the following categorical variables: cut, clarity, or color.

ggplot(data=diamonds,aes(x=color,y=price/carat))+
  geom_boxplot()

by(diamonds$price,diamonds$color,summary)
## diamonds$color: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     357     911    1838    3170    4214   18693 
## -------------------------------------------------------- 
## diamonds$color: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     882    1739    3077    4003   18731 
## -------------------------------------------------------- 
## diamonds$color: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     342     982    2344    3725    4868   18791 
## -------------------------------------------------------- 
## diamonds$color: G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     354     931    2242    3999    6048   18818 
## -------------------------------------------------------- 
## diamonds$color: H
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     337     984    3460    4487    5980   18803 
## -------------------------------------------------------- 
## diamonds$color: I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1120    3730    5092    7202   18823 
## -------------------------------------------------------- 
## diamonds$color: J
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     335    1860    4234    5324    7695   18710

Inter quartile range IQR

summary(subset(diamonds,color=='D')$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     357     911    1838    3170    4214   18693
summary(subset(diamonds,color=='J')$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     335    1860    4234    5324    7695   18710
IQR(subset(diamonds,color=='D')$price)
## [1] 3302.5
IQR(subset(diamonds,color=='J')$price)
## [1] 5834.5

Investigate the price per carat of diamonds across the different colors of diamonds using boxplots.

ggplot(data=diamonds,aes(x=price/carat))+
  geom_histogram()+
  scale_x_log10()+
  facet_wrap(~color,scales='free',nrow=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

carat frequency

ggplot(data=diamonds,aes(x=carat))+
  geom_freqpoly(binwidth=.01)

ggplot(data=diamonds,aes(x=x,y=price))+
         geom_point()

What is going on? there is some exponential trend

with(diamonds,cor.test(x,price))
## 
##  Pearson's product-moment correlation
## 
## data:  x and price
## t = 440.16, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8825835 0.8862594
## sample estimates:
##       cor 
## 0.8844352
with(diamonds,cor.test(y,price))
## 
##  Pearson's product-moment correlation
## 
## data:  y and price
## t = 401.14, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8632867 0.8675241
## sample estimates:
##       cor 
## 0.8654209
with(diamonds,cor.test(z,price))
## 
##  Pearson's product-moment correlation
## 
## data:  z and price
## t = 393.6, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8590541 0.8634131
## sample estimates:
##       cor 
## 0.8612494
ggplot(data=diamonds,aes(x=depth,y=price))+
  geom_point(alpha=1/100)+
  scale_x_continuous(breaks=seq(2,80,2))

typical depth range is between 58 and 64.

with(diamonds,cor.test(depth,price))
## 
##  Pearson's product-moment correlation
## 
## data:  depth and price
## t = -2.473, df = 53938, p-value = 0.0134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.019084756 -0.002208537
## sample estimates:
##        cor 
## -0.0106474
ggplot(data=subset(diamonds,price<quantile(price,.99) & carat<quantile(carat,.99)),
       aes(x=carat,y=price))+
  geom_point()

diamonds%>%
  mutate(volume=x*y*z)->diamonds
ggplot(data=diamonds,aes(x=volume,y=price))+
  geom_point()

How many diamonds have 0 volume??

sum(diamonds$volume==0)
## [1] 20

what is the correlation of price and volume, excluding diamonds with volume=0 or >=800

with(subset(diamonds,volume>0 & volume<=800),cor.test(price,volume))
## 
##  Pearson's product-moment correlation
## 
## data:  price and volume
## t = 559.19, df = 53915, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9222944 0.9247772
## sample estimates:
##       cor 
## 0.9235455

price vs volume

We encourage you to think about this next question and to post your thoughts in the discussion section.

Do you think this would be a useful model to estimate the price of diamonds? Why or why not?

ggplot(data=subset(diamonds,volume>0 & volume<=800),
       aes(x=volume,y=price))+
  geom_point(alpha=1/50)+
  geom_smooth()
## `geom_smooth()` using method = 'gam'

This does not seem like a great fit because it shows that at some point, diamonds get cheaper per volume, which is very false.

Use the function dplyr package to create a new data frame containing info on diamonds by clarity.

Name the data frame diamondsByClarity

diamondsByClarity<-diamonds%>%
  group_by(clarity)%>%
  summarise(mean_price=mean(price),
            median_price=median(price),
            min_price=min(price),
            max_price=max(price),
            n=n())
head(diamondsByClarity)
## # A tibble: 6 x 6
##   clarity mean_price median_price min_price max_price     n
##   <ord>        <dbl>        <dbl>     <dbl>     <dbl> <int>
## 1 I1            3924         3344       345     18531   741
## 2 SI2           5063         4072       326     18804  9194
## 3 SI1           3996         2822       326     18818 13065
## 4 VS2           3925         2054       334     18823 12258
## 5 VS1           3839         2005       327     18795  8171
## 6 VVS2          3284         1311       336     18768  5066
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))

diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))

diamonds_mp_by_clarity
## # A tibble: 8 x 2
##   clarity mean_price
##   <ord>        <dbl>
## 1 I1            3924
## 2 SI2           5063
## 3 SI1           3996
## 4 VS2           3925
## 5 VS1           3839
## 6 VVS2          3284
## 7 VVS1          2523
## 8 IF            2865
dclar<-ggplot(data=diamonds_mp_by_clarity,aes(x=clarity,y=mean_price))+
  geom_col()

dcol<-ggplot(data=diamonds_mp_by_color,aes(x=color,y=mean_price))+
  geom_col()

grid.arrange(dclar,dcol,ncol=1)

Multiple variables

Create a histogram of diamond prices.

Facet the histogram by diamond color and use cut to color the histogram bars.

ggplot(data=diamonds,aes(x=price,fill=cut))+
  geom_histogram(bins=50)+
  scale_x_log10()+
  facet_wrap(~color)+
  scale_fill_brewer(type='qual')

Create a scatterplot of diamond price vs. table and color the points by the cut of the diamond.

ggplot(data=diamonds,aes(y=price,x=table,color=cut))+
  geom_point()+
  scale_color_brewer(type='qual')

What is the typical table range for diamonds of ideal cut?

53-57

What is the typical table range for diamonds of premium cut?

58-62

Create a scatterplot of diamond price vs. volume (x * y * z) and color the points by the clarity of diamonds.

Use scale on the y-axis to take the log10 of price. You should also omit the top 1% of diamond volumes from the plot.

ggplot(data=subset(diamonds,volume<quantile(volume,.99)&
                     volume>0),
       aes(x=volume,y=price,color=clarity))+
  geom_point()+
  scale_y_log10()+
  scale_color_brewer(type='div')

Create a scatter plot of the price/carat ratio of diamonds. The variable x should be assigned to cut. The points should be colored by diamond color, and the plot should be faceted by clarity.

ggplot(data=diamonds,aes(x=cut,y=price/carat,color=color))+
  geom_jitter(size=1)+
  facet_wrap(~clarity)+
  scale_color_brewer(type='div')